Nicola De Cristofaro (Matr. 0522500876) Cloud Computing Curriculum

Hospital Readmission

What do we mean by readmission? We mean an admission to the hospital for a patient that is already been hospitalized within a certain period of time.

1. Problem Statement

The goal is to create a model that predicts which patients are at risk for 30-day, 90-day or 365-day unplanned readmission utilizing patients demographics, diagnoses and icustays.

2. Type of model used for prediction

"Re-Admission" is a categorical attribute (YES,NO), so a classification model is used.

3. Metrics used for validation

For validation we use the metrics derived from the Confusion Matrix. The structure of this kinf o matrix for a binary classifier is the following:

confusionMatrix

There are 4 important terms in this performance:

From these terms we can compute our metrics used for validate the model:

We start importing our baseline dataset extracted selecting only the necessary tables from MIMIC dataset.

Since we want to predict a readmission we have to get the next admission date for a patient if it exists.

Since we want to predict UNPLANNED re-admissions, so we should filter out the ELECTIVE (planned) next admissions.

Now we calculate days until next admission, because we want to predict unplanned re-admission within a specific range of days (30, 90, 365).

Ethnicity attribute correlation

To notice that "ASIAN" patients have the lowest median of days until next admission.

ADMISSION_TYPE attribute

As we could expect we a larger median of days until next admission for those admission which the patients had a surgical operation in the same day. This is expected because after a surgical operation is probable to be readmitted in a short time of period for observation usually.

Age attribute

As we did for our previous tasks, because of the discrete-like distribution of data on the extremes of age, it could be useful to convert all ages into the categories of newborn, young adult, middle adult, and senior for use in the prediction model.

Now, let's analyze the diagnosis in correlation to the number of day until next admission.

We notice that:

The data in the ICUSTAYS table could be useful because indicates if a patient during an admission was in an ICU (Intensive Care Unit). This could be a factor that could increment the possibility that a patient is readmitted to the hospital within a certain period of time.

5. Data cleaning

Compute the OUTPUT LABELs to predict

The final DataFrame size resulted in 43 feature columns and 1 target column (READMISSION_30) or (READMISSION_90) or (READMISSION_365) alternatively with an entry count of 160543.

6. Prediction Model

We use a Supervised Learning ML model. First of all what is it? Supervised learning is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately. It uses a training set to teach models to yield the desired output. This training dataset includes inputs and correct outputs, which allow the model to learn over time. The algorithm measures its accuracy through the loss function, adjusting until the error has been sufficiently minimized.

Why do we choose it? Because in our case we have the corret output for each dataset entry: "READMISSION_30" (Yes or No - 1 or 0) and we want to create a model that predicts this output for new entries, in other words that it "generalize well".

We will implement the supervised learning prediction model using the Scikit-Learn machine learning library.

To implement the prediction model, our dataset is splitted into training and test sets at an 80:20 ratio using the scikit-learn train_test_split function.

Why split in training and test set? Because to detect a machine learning model behavior, we need to use observations that aren’t used in the training process. Otherwise, the evaluation of the model would be biased as a matter of fact when we build a predictive model, we want the model to work well on data that the model has never seen, so that's the reason why we use a training set to train the model and a test set to evaulate the model accuaracy.

Searching on the Internet for the best train-test ratio, the first answer is 80:20. This means we use 80% of the observations for training and the rest for testing. This approach is taken in this case. zability)

Logistic regression, despite its name, is a linear model for classification rather than regression. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

7. Model Evalutaion & Parameter Tuning

Calculate Performance metrics

Parameter Tuning

All decisions will have a trade-off on the metrics described above. Let's choose AUC for this task as it balances FPR and TPR.

Let's execute againg the prediction with the tuned hyperparamter of LogisticRegression C =0.003 that maximize our performance metric AUC curve.

We can see that we tuned parameters, the AUC score is improved.

8. Result Discussion

In the previous section we already saw the perfomrance metrics of the model. Now let's try to see what features were most important in predicting 30 days hospital readmission when using the logistic regression classifier model.

30 DAYS-READMISSION result: the result from our study is that "Newborn patients" are more likely to be readmitted after 30 days from last discarge from hospital. This could be expected for newborn patients due to usual observatory purposes.

Readmission_90 , Readmission_365

Now we do the same prediction with "LogisticRegression" and with the same paramters, but now we want to predict readmission to the hospital within 90 and 365 days from the last discharge.

90 DAYS-READMISSION result: the result is similar from 30-DAYS-READMISSION so also in this case "Newborn patients" are more likely to be readmitted after 90 days from last discarge from hospital. But we can see how also the feature "neoplasm" has an importance pretty high in prediction. This means that, in addition to "newborn patients", also patients diagnosed in the category "neoplasm" are more likely to be readmitted to the hospital after 90 days from discharge.

Finally let's check with 365 time limit.

365 DAYS-READMISSION result: in this task's result we can see how the feature "neoplasm" has the highest importance prediction. It is followed by the feature of admissions to the hospital for observation. So we could say that patients diagnosed diagnosed in the category "neoplasm" and patients admitted for observation purposes are more likely to be readmitted to the hospital after 365 days from discharge.

Conclusions for Hospital Readmission

We saw how the probability of readmission to the hospital after discharge changes if we consider different period of time. But most important, we saw that in all of the three time interval considered (30, 90, 365 days after discharge) patients diagnosed in category "neoplasm" has always an high probability to be readmitted, expecially in the 65-days.hasattr

With this kinf of insights it is possible to plan and better manage admissions and hospital stays avoiding crowds and consequently the possibility of getting infections on the hospital.